Conceptual

1

To test the null hypotesis (b1 = 0) we determine whether bhat1 is sufficiently far from zero by computing a t-statistic and computing the probability to observe any value equal to abs(t) or greater, assuming b1 = 0.

The t-statistic measures the numbers of standard deviations that bhat1 is away from 0.

p-value is the computed probability to observe values equal to abs(t), assuming b1 = 0

“Large” t-values can be positive or negative, and the farther from 0 a value is the lower the probability to observe abs(t) or greater by chance.

In practice, a small p-value indicates it is unlikely to observe a substantial association between the predictor and the response due to chance.

The null hypothesis is H0:β=0 given the other variables in the model, or simply that for any coefficient β there isn’t a relationship between β and the target variable.

Furthermore, each p-value in Table 3.4 corresponds to the probability of the respective coefficient being zero. Both p-values for TV and radio are low, however the on for newspaper is not. one can conclude that only TV and radio are related to sales, i.e. there is association between the predictors TV and radio and sales but not newspapers and sales.

2

The KNN classifier dtermines which class a new data point belongs to by assigning it to the class of higher frequency in the K-neighborhood of the point.

KNN regression’s target is not a class label but rather a numerical value for f(x) given a prediction point x0. The new point value is estimated as the mean of the points in the neighborhood N0.

3

Multiple predictors linear regression with interction terms and dummy variables for 2 level predictor (gender).

With this choice of dummy variable males have β^3x0 = 0 (35x0) and β^5x0 = 0 so GPAxGender and coeff.xGender are removed from the euquation.

3a

For a fixed value of IQ and GPA, males will earn more than female when GPA is above 3.5. This is because the Female equation has the term -10xGPA, and once GPA>3.5 this will cancel the positive term given by β^3 (35).

3b

# Female: 85 + 20×GPA + 0.07×IQ + 0.01×GPA:IQ − 10×GPA

salary = 85 + 20*4.0 + 0.07*110 + 0.01*4*110 - 10*4.0

salary
[1] 137.1

3c

False. The size of the coeefiecient is not an indication of the interaction effect per se. Rather, the p-value and relative t-statistic can tell if interacted predictor terms are significantly associated with the response.

4

4a

We could think the RSS to be smaller for the simple linear regression since the true f(x) is linear. Normally, a small RSS indicates a tighter fit of the model to the data.

However the RSS decreases as polynomial order increases, since having more predictors normally means a tighter fit. We can observe this empirically:


# training args: partition, sample size
training <- function(p, n) {
set.seed(1)
x <- rnorm(n)
y <- 5 + 2*x + rnorm(n, sd = 0.5)

# fitting linear and cubic models
lfit <- lm(y[p] ~ x[p])
pfit <- lm(y[p] ~ x[p] + I(x[p]^2) + I(x[p]^3))

# finding RSE (sigma) with summary func
sigma_linear <- summary(lfit)$sigma
sigma_polynomial <- summary(pfit)$sigma

# finding RSS - done by summing the residuals of squares
rss_lin <- sum(resid(lfit)^2)
rss_pol <- sum(resid(pfit)^2)

# display values
c(linear_RSE = sigma_linear, linear_RSS = rss_lin, 
  poly_RSE = sigma_polynomial, poly_RSS = rss_pol)
}

# using subset of data for training RSS
training(1:20, 100)
linear_RSE linear_RSS   poly_RSE   poly_RSS 
 0.4177607  3.1414328  0.4203941  2.8276994 
# using the remain data
# training(20:100, 100)

If we run 1,000 simulations on the above for RSS and review the distribution with a boxplot we can see RSS is lower the higher the order of the polynomial:

mat = matrix(rep(0,5000), nrow=1000)
for(i in 1:1000){
  n = 100
  x = rnorm(n)
  y = 5 + 2 * x + rnorm(n, 0.5)
  for(j in 1:5){
  # calculate RSS for the matrix (mat)  
  mat[i,j] = sum(residuals(lm(y ~ poly(x,j,raw=T)))^2)
  }              
}

boxplot(mat, main='RSS for increasing polynomial order',
        xlab='Polynomial order', ylab='RSS',
        col=(c("red","orange","gold",
               "yellow", "lightyellow")))

4b

Refer to 4a

4c

We’d expect the RSS to be lower for the cubic regression since higher order polynomial fit the data closest - higher number of predictors equals a tighter fit.

5

Using equation (3.4) on page 62, when xi=x(mean), then β1^=0 and β0^=y¯. The equation for yi^ evaluates to y(mean).

Applied

8

library(ISLR)
attach(Auto)
head(Auto)

Fitting a linear regression model using lm()

regr.fit <- lm(mpg~horsepower, data=Auto); summary(regr.fit)

Call:
lm(formula = mpg ~ horsepower, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5710  -3.2592  -0.3435   2.7630  16.9240 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.935861   0.717499   55.66   <2e-16 ***
horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared:  0.6059,    Adjusted R-squared:  0.6049 
F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

8a: i-iii

From the summary we can see a large t-statistic with corresponding low significant p-value, hence there is a relationship between predictor and response. The coefficient for horsepower is negative implying a decreasing relationship with mpg.

8a: iv

According to our model: mpg = 39.9358 + (-0.1578*horsepower)

Plugging the numbers: mpg = 39.9358 - 0.1578*98 = 24.418

Using predict():

predict(regr.fit, data.frame(horsepower = (c(98))), interval = "confidence")
       fit      lwr      upr
1 24.46708 23.97308 24.96108

8b

plot(mpg~horsepower, data=Auto, main='mpg ~ horsepower')
abline(regr.fit, lwd=3, col='tomato')

par(mfrow=c(2,2))
plot(regr.fit)

The residual data of the simple linear regression model is the difference between the observed data of the dependent variable y and the fitted values ŷ, i.e. i-th-residual = i-th(y - ŷ).

Here the residuals show a pattern of no linearity. Point 334 could identify an an outlier.

plot(regr.fit, which=1, col="cornflowerblue", pch='+')
abline(0, 0)

Alternative, more graphical ways to visualise residuals in-plot are possible.

Step 1

With the model fitted, we store residuals and predicted values in two new variables:

Auto.copy <- Auto # copy Auto in another df for this
Auto.copy$predicted <- predict(regr.fit) # adds predicted to the copied df
Auto.copy$residuals <- residuals(regr.fit)
# initially I had Auto$predicted. For this reason I use drop in Q9 below


# Quick look at the actual, predicted and residual values
# install.packages("dplyr")
library(dplyr)
Auto.copy %>% select(mpg, predicted, residuals) %>% head()
Step 2

Plot the actual and predicted values.

First, plot the actual data:

# install.packages("ggplot2")
library(ggplot2) # using ggplot style
ggplot(Auto.copy, aes(x=horsepower, y=mpg))+ # set up the canvas
  geom_point() # plot the actual points

Next, we plot the predicted value from our regression - we want these to be distinguishable from the actuals.

ggplot(Auto.copy, aes(x=horsepower, y=mpg))+ # set up the canvas
  geom_point()+# plot the actual points
  geom_point(aes(y=predicted), shape=1) # add the predicted values

To show how actual and predicted values are related we can connect them using geom_segment():

ggplot(Auto.copy, aes(x=horsepower, y=mpg))+ # set up the canvas
  geom_point()+# plot the actual points
  geom_point(aes(y=predicted), shape=1)+
  geom_segment(aes(xend=horsepower, yend=predicted))

Further adjustment to: * Clean up the look with theme_bw() * Fade out connection lines by adjusting their alpha * Add our regression line with geom_smooth()

library(ggplot2)
ggplot(Auto.copy, aes(x = horsepower, y = mpg))+
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey")+  # Plot regression slope
  geom_segment(aes(xend = horsepower, yend = predicted), alpha = .2)+  # alpha to fade lines
  geom_point()+
  geom_point(aes(y = predicted), shape = 1)+
  theme_bw()  # Add theme for cleaner look

Step 3

Now we want to make an adjustment to highlight the size of the residuals.

We can visualise the relationship between actual values and predicted with the help of colors. Using colors to highlight larger or lower values of actual and predicted helps identify non linearity in the data.

For example, we can see that there is more red for extreme values of horsepower (close to lower and upper bound of the values range, i.e. < 75, > 150) indicating actual values are greater than what is being predicted. For centre values (75 < x < 150) we see more blue, indicating the actuals are less that what is being predicted.

# COLOR UNDER/OVER
# Color mapped to residual with sign taken into account.
# i.e., whether actual value is greater or less than predicted
ggplot(Auto.copy, aes(x=horsepower, y=mpg))+
  geom_smooth(method="lm", se=FALSE, color="lightgrey")+
  geom_segment(aes(xend=horsepower, yend=predicted), alpha=.2)+

  # > Color adjustments made here...
  geom_point(aes(color=residuals))+  # Color mapped here
  scale_color_gradient2(low="blue", mid="white", high="red")+  # Colors to use here
  guides(color=FALSE)+ # Color legend removed
  # <

  geom_point(aes(y=predicted), shape=1)+
  theme_bw()

9

# dropping predicted and residual cols that were added earlier
# head(Auto)

drops <- c("predicted","residuals") # cols to drop
Auto <- Auto[, !(names(Auto) %in% drops)] # re-framing 

head(Auto)
NA

9a

# plot(mpg~., data=Auto, main='mpg ~ all')

pairs(Auto, pch=21) # [1:4] would select a subset of the columns


# pairs(Auto[1:4], pch=21)
# pairs(Auto[5:8])

9a

drops <- c("name") # dropping name as qualitative
Auto.cor <- Auto[, !(names(Auto) %in% drops)] 


cor(Auto.cor)
                    mpg  cylinders displacement horsepower     weight
mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
             acceleration       year     origin
mpg             0.4233285  0.5805410  0.5652088
cylinders      -0.5046834 -0.3456474 -0.5689316
displacement   -0.5438005 -0.3698552 -0.6145351
horsepower     -0.6891955 -0.4163615 -0.4551715
weight         -0.4168392 -0.3091199 -0.5850054
acceleration    1.0000000  0.2903161  0.2127458
year            0.2903161  1.0000000  0.1815277
origin          0.2127458  0.1815277  1.0000000
# could also do
# cor(subset(Auto, select=-name)) # applies func to all minus name

9b

mregr.fit <- lm(mpg~. -name, data=Auto); summary(mregr.fit)

Call:
lm(formula = mpg ~ . - name, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
cylinders     -0.493376   0.323282  -1.526  0.12780    
displacement   0.019896   0.007515   2.647  0.00844 ** 
horsepower    -0.016951   0.013787  -1.230  0.21963    
weight        -0.006474   0.000652  -9.929  < 2e-16 ***
acceleration   0.080576   0.098845   0.815  0.41548    
year           0.750773   0.050973  14.729  < 2e-16 ***
origin         1.426141   0.278136   5.127 4.67e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared:  0.8215,    Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

9a: i-iii

i. The F-statistic is large with an associated low (significant) p-value, so we can say that overall there is relationship between the predictors and the response.

  • F-test in regression compares the fits of different linear models. Unlike t-tests that can assess only one regression coefficient at a time, the F-test can assess multiple coefficients simultaneously.

  • The F-test of the overall significance is a specific form of the F-test. It compares a model with no predictors to the model that you specify. A regression model that contains no predictors is also known as an intercept-only model.

  • The hypotheses for the F-test of the overall significance are as follows:

    • Null hypothesis: The fit of the intercept-only model and your model are equal.
    • Alternative hypothesis: The fit of the intercept-only model is significantly reduced compared to your model.

If the P value for the F-test of overall significance test is less than your significance level, you can reject the null-hypothesis and conclude that your model provides a better fit than the intercept-only model


ii. Displacement, weight, year and origin have close to zero level of significance.

iii. The year coefficient is positive indicating a positive slope angle for the predictor. In practice we can infer mpg increases the newer is the car. We can observe this trend if we plot mpg regressed on year data:

plot(mpg~year, data=Auto)

9d

Plotting the diagnostic plots with the plot() function:

par(mfrow=c(2,2))
plot(mregr.fit)

Detail for residuals and studentized residuals (leverage):

plot(mregr.fit, which=1, col="cornflowerblue", pch='+')

plot(mregr.fit, which=5, col="cornflowerblue", pch='+')
abline(0, 0)

The residual plot shows outliers for points 326, 323, 327.

The data seem to exhibit heteroscedasdicity (non-constant variance in the errors), whereby the magnitude of the residuals increases with the fitted values. This is indidcated visually from a funnel shape.

The leverage plot identifies anomalous values of the independent in respect to the explained variable. For example we observe high leverage at index 14: the predicted mpg for the horsepower value of 225 is 18.88 however the actual mpg value is 14, hence a residual value of -4.88.

Other points in the plot exhibit similar features, for example at index 6 (please refer to table below).

mmAuto.copy <- Auto # copying the df to inspect residuals
mmAuto.copy$predicted <- predict(mregr.fit)  # adding columns
mmAuto.copy$residuals <- residuals(mregr.fit)

mmAuto.copy %>% select(horsepower, mpg, predicted, residuals) %>% head(20)
# mmAuto.copy[14, ] # shows only row at index 14 (the identified leverage point)

9e

Selecting 4 predictors and regressing on each plus interactions.

iregr.fit_a <- lm(mpg ~ weight +displacement +year +origin
                -name -acceleration,
                data=Auto)
iregr.fit_a1 <- lm(mpg ~ weight*displacement +year +origin
                -name -acceleration,
                data=Auto)
iregr.fit_a2 <- lm(mpg ~  +weight +displacement +year*origin
                 -name -acceleration, 
                 data=Auto)

summary(iregr.fit_a)

Call:
lm(formula = mpg ~ weight + displacement + year + origin - name - 
    acceleration, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8102 -2.1129 -0.0388  1.7725 13.2085 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.861e+01  4.028e+00  -4.620 5.25e-06 ***
weight       -6.575e-03  5.571e-04 -11.802  < 2e-16 ***
displacement  5.588e-03  4.768e-03   1.172    0.242    
year          7.714e-01  4.981e-02  15.486  < 2e-16 ***
origin        1.226e+00  2.670e-01   4.593 5.92e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.346 on 387 degrees of freedom
Multiple R-squared:  0.8181,    Adjusted R-squared:  0.8162 
F-statistic: 435.1 on 4 and 387 DF,  p-value: < 2.2e-16
summary(iregr.fit_a1)

Call:
lm(formula = mpg ~ weight * displacement + year + origin - name - 
    acceleration, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.6119  -1.7290  -0.0115   1.5609  12.5584 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -8.007e+00  3.798e+00  -2.108   0.0357 *  
weight              -1.054e-02  6.530e-04 -16.146  < 2e-16 ***
displacement        -7.148e-02  9.176e-03  -7.790 6.27e-14 ***
year                 8.194e-01  4.518e-02  18.136  < 2e-16 ***
origin               3.567e-01  2.574e-01   1.386   0.1666    
weight:displacement  2.104e-05  2.214e-06   9.506  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.016 on 386 degrees of freedom
Multiple R-squared:  0.8526,    Adjusted R-squared:  0.8507 
F-statistic: 446.5 on 5 and 386 DF,  p-value: < 2.2e-16
summary(iregr.fit_a2)

Call:
lm(formula = mpg ~ +weight + displacement + year * origin - name - 
    acceleration, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.7541 -1.8722 -0.0936  1.6900 12.4650 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.927e+00  8.873e+00   0.893 0.372229    
weight       -6.394e-03  5.526e-04 -11.571  < 2e-16 ***
displacement  1.551e-03  4.859e-03   0.319 0.749735    
year          4.313e-01  1.130e-01   3.818 0.000157 ***
origin       -1.449e+01  4.707e+00  -3.079 0.002225 ** 
year:origin   2.023e-01  6.047e-02   3.345 0.000904 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.303 on 386 degrees of freedom
Multiple R-squared:  0.8232,    Adjusted R-squared:  0.8209 
F-statistic: 359.5 on 5 and 386 DF,  p-value: < 2.2e-16
iregr.fit_b <- lm(mpg ~ horsepower +cylinders +weight +displacement 
                  -name -acceleration,
                  data=Auto)
iregr.fit_b1 <- lm(mpg ~ horsepower*cylinders +weight +displacement 
                   -name -acceleration,
                   data=Auto)
iregr.fit_b2 <- lm(mpg ~ horsepower +cylinders +weight*displacement 
                   -name -acceleration,
                   data=Auto)

summary(iregr.fit_b)

Call:
lm(formula = mpg ~ horsepower + cylinders + weight + displacement - 
    name - acceleration, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.5248  -2.7964  -0.3568   2.2577  16.3221 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  45.7567705  1.5200437  30.102  < 2e-16 ***
horsepower   -0.0428125  0.0128699  -3.327 0.000963 ***
cylinders    -0.3932854  0.4095522  -0.960 0.337513    
weight       -0.0052772  0.0007166  -7.364 1.08e-12 ***
displacement  0.0001389  0.0090099   0.015 0.987709    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.242 on 387 degrees of freedom
Multiple R-squared:  0.7077,    Adjusted R-squared:  0.7046 
F-statistic: 234.2 on 4 and 387 DF,  p-value: < 2.2e-16
summary(iregr.fit_b1)

Call:
lm(formula = mpg ~ horsepower * cylinders + weight + displacement - 
    name - acceleration, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.9525  -2.2068  -0.3781   1.8965  16.3082 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)          68.8806613  3.0114028  22.873  < 2e-16 ***
horsepower           -0.3328210  0.0355014  -9.375  < 2e-16 ***
cylinders            -4.3199327  0.5885321  -7.340 1.27e-12 ***
weight               -0.0033373  0.0006937  -4.811 2.16e-06 ***
displacement         -0.0149953  0.0084380  -1.777   0.0763 .  
horsepower:cylinders  0.0411987  0.0047570   8.661  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.886 on 386 degrees of freedom
Multiple R-squared:  0.7552,    Adjusted R-squared:  0.752 
F-statistic: 238.2 on 5 and 386 DF,  p-value: < 2.2e-16
summary(iregr.fit_b2)

Call:
lm(formula = mpg ~ horsepower + cylinders + weight * displacement - 
    name - acceleration, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.4271  -2.4824  -0.4691   1.9872  15.9704 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          5.772e+01  2.110e+00  27.358  < 2e-16 ***
horsepower          -6.779e-02  1.244e-02  -5.449 9.06e-08 ***
cylinders            9.741e-02  3.874e-01   0.251    0.802    
weight              -9.082e-03  8.330e-04 -10.903  < 2e-16 ***
displacement        -7.751e-02  1.317e-02  -5.885 8.66e-09 ***
weight:displacement  2.175e-05  2.840e-06   7.658 1.54e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.957 on 386 degrees of freedom
Multiple R-squared:  0.7462,    Adjusted R-squared:  0.7429 
F-statistic:   227 on 5 and 386 DF,  p-value: < 2.2e-16

9f

Trying log, square root and quadratic transformations for horsepower.

regr.fit_poly2 <- lm(mpg ~ horsepower +I(horsepower^2)
                     +weight +year,data=Auto)
regr.fit_poly3 <- lm(mpg ~ poly(horsepower, 3) # using poly includes 1,2,3 deg
                     +weight +year, data=Auto) 
regr.fit_log <- lm(mpg ~ horsepower +I(log(horsepower)) 
                   +weight +year, data=Auto) 
regr.fit_sqrt <- lm(mpg ~ horsepower +I(sqrt(horsepower))
                    +weight +year, data=Auto)

summary(regr.fit_poly2)

Call:
lm(formula = mpg ~ horsepower + I(horsepower^2) + weight + year, 
    data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.5686 -2.0271 -0.0978  1.8941 13.0043 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -3.5836719  3.9432463  -0.909    0.364    
horsepower      -0.2465520  0.0274476  -8.983   <2e-16 ***
I(horsepower^2)  0.0008632  0.0000932   9.261   <2e-16 ***
weight          -0.0051121  0.0003975 -12.860   <2e-16 ***
year             0.7543108  0.0472204  15.974   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.108 on 387 degrees of freedom
Multiple R-squared:  0.8431,    Adjusted R-squared:  0.8415 
F-statistic: 519.9 on 4 and 387 DF,  p-value: < 2.2e-16
summary(regr.fit_poly3)

Call:
lm(formula = mpg ~ poly(horsepower, 3) + weight + year, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.1690 -1.8868  0.0949  1.6838 12.6012 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -1.829e+01  3.576e+00  -5.115 4.96e-07 ***
poly(horsepower, 3)1 -1.707e+01  6.782e+00  -2.517   0.0122 *  
poly(horsepower, 3)2  2.977e+01  3.264e+00   9.118  < 2e-16 ***
poly(horsepower, 3)3 -1.434e+01  3.100e+00  -4.624 5.14e-06 ***
weight               -5.500e-03  3.965e-04 -13.874  < 2e-16 ***
year                  7.649e-01  4.608e-02  16.599  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.029 on 386 degrees of freedom
Multiple R-squared:  0.8513,    Adjusted R-squared:  0.8494 
F-statistic: 442.1 on 5 and 386 DF,  p-value: < 2.2e-16
summary(regr.fit_sqrt)

Call:
lm(formula = mpg ~ horsepower + I(sqrt(horsepower)) + weight + 
    year, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3718 -1.8843  0.0377  1.7125 12.8009 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         33.0384452  5.9518311   5.551 5.28e-08 ***
horsepower           0.4040627  0.0414611   9.746  < 2e-16 ***
I(sqrt(horsepower)) -9.3358246  0.9265639 -10.076  < 2e-16 ***
weight              -0.0052515  0.0003833 -13.701  < 2e-16 ***
year                 0.7605633  0.0464642  16.369  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.057 on 387 degrees of freedom
Multiple R-squared:  0.8482,    Adjusted R-squared:  0.8466 
F-statistic: 540.4 on 4 and 387 DF,  p-value: < 2.2e-16
summary(regr.fit_log)

Call:
lm(formula = mpg ~ horsepower + I(log(horsepower)) + weight + 
    year, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2256 -1.8242  0.0084  1.7276 12.7952 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         7.265e+01  9.284e+00   7.825 4.89e-14 ***
horsepower          1.853e-01  2.053e-02   9.023  < 2e-16 ***
I(log(horsepower)) -2.411e+01  2.375e+00 -10.154  < 2e-16 ***
weight             -5.339e-03  3.799e-04 -14.053  < 2e-16 ***
year                7.630e-01  4.640e-02  16.445  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.052 on 387 degrees of freedom
Multiple R-squared:  0.8486,    Adjusted R-squared:  0.8471 
F-statistic: 542.5 on 4 and 387 DF,  p-value: < 2.2e-16

10

attach(Carseats)
head(Carseats)
fit.lm <- lm(Sales ~ Price +Urban +US,
             data=Carseats)
summary(fit.lm)

Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9206 -1.6220 -0.0564  1.5786  7.0581 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
Price       -0.054459   0.005242 -10.389  < 2e-16 ***
UrbanYes    -0.021916   0.271650  -0.081    0.936    
USYes        1.200573   0.259042   4.635 4.86e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2335 
F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Parameters * Sales: sales in thousands at each location * Price: price charged for car seats at each location * Urban: No/Yes by location * US: No/Yes by location

Coefficients comments

10b

Sales = 13.043 - 0.054 x Price - 0.022 x UrbanYes + 1.201 x USYes

# Sales: 13.043 - 0.054×Price + 0.022×UrbanYes + 1.201×USYes

10c

We can reject the null hypotesis for Price and USYes (low p-values).

10d

fit.lm1 <- lm(Sales ~ Price +US,
              data=Carseats)
summary(fit.lm1)

Call:
lm(formula = Sales ~ Price + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9269 -1.6286 -0.0574  1.5766  7.0515 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
Price       -0.05448    0.00523 -10.416  < 2e-16 ***
USYes        1.19964    0.25846   4.641 4.71e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2354 
F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16
# install.packages("plotly")
library(plotly)

Carseats$US[which(Carseats$US == 0)] <- 'Yes'
Carseats$US[which(Carseats$US == 1)] <- 'No'
Carseats$US <- as.factor(Carseats$US)

p <- plot_ly(Carseats, x = ~Price, y = ~Advertising, z = ~Sales, color = ~US,
             colors = c('#BF382A', '#0C4B8E')) %>%
  
  add_markers() %>%
  
  layout(scene = list(xaxis = list(title = 'Price'),
                     yaxis = list(title = 'Advertising'),
                     zaxis = list(title = 'Sales')))

p

10f

  • fit.lm (Price, Urban, US): RSE = 2.472 R^2 = 0.2393
  • fit.lm1 (Price, US): RSE = 2.469 R^2 = 0.2393

fit.lm1 has a slightly better (lower) RSE value and one less predictor variable.

10g

confint(fit.lm1)
                  2.5 %      97.5 %
(Intercept) 11.79032020 14.27126531
Price       -0.06475984 -0.04419543
USYes        0.69151957  1.70776632

10h

par(mfrow=c(2,2))
# residuals v fitted plot doesn't show strong outliers
plot(fit.lm1) 

---
title: "ISLR Chapter 3 Exercises"
knit: (function(input_file, encoding) {
  out_dir <- 'docs';
  rmarkdown::render(input_file,
 encoding=encoding,
 output_file=file.path(dirname(input_file), out_dir, 'index.html'))})
output: html_notebook
---

# *Conceptual*

# 1

To test the null hypotesis (b1 = 0) we determine whether bhat1 is sufficiently far from zero by computing a t-statistic and computing the probability to observe any value equal to abs(t) or greater, assuming b1 = 0.

The t-statistic measures the numbers of standard deviations that bhat1 is away from 0.

p-value is the computed probability to observe values equal to abs(t), assuming b1 = 0

"Large" t-values can be positive or negative, and the farther from 0 a value is the lower the probability to observe abs(t) or greater by chance.

In practice, a small p-value indicates it is unlikely to observe a substantial association between the predictor and the response due to chance. 

The null hypothesis is **H0:β=0** given the other variables in the model, or simply that for any coefficient β there isn’t a relationship between β and the target variable.

Furthermore, each p-value in Table 3.4 corresponds to the probability of the respective coefficient being zero. Both p-values for TV and radio are low, however the on for newspaper is not. one can conclude that only TV and radio are related to sales, i.e. there is association between the predictors TV and radio and sales but not newspapers and sales.


# 2

The KNN classifier dtermines which class a new data point belongs to by assigning it to the class of higher frequency in the K-neighborhood of the point. 

KNN regression's target is not a class label but rather a numerical value for f(x) given a prediction point x0. The new point value is estimated as the mean of the points in the neighborhood N0.


# 3

* *Female salary*: 85 + 20×GPA + 0.07×IQ + 0.01×GPA:IQ − 10×GPA

* *Male salary*: 50 + 20×GPA + 0.07×IQ + 0.01×GPA:IQ

Multiple predictors linear regression with interction terms and dummy variables for 2 level predictor (gender).

With this choice of dummy variable males have β^3x0 = 0 (35x0) and β^5x0 = 0 so GPAxGender and coeff.xGender are removed from the euquation. 

#### 3a
For a fixed value of IQ and GPA, males will earn more than female when GPA is above 3.5. This is because the Female equation has the term -10xGPA, and once GPA>3.5 this will cancel the positive term given by β^3 (35).

#### 3b
```{r}
# Female: 85 + 20×GPA + 0.07×IQ + 0.01×GPA:IQ − 10×GPA

salary = 85 + 20*4.0 + 0.07*110 + 0.01*4*110 - 10*4.0

salary
```

#### 3c
False. The size of the coeefiecient is not an indication of the interaction effect per se. Rather, the p-value and relative t-statistic can tell if interacted predictor terms are significantly associated with the response.


# 4

#### 4a
We could think the RSS to be smaller for the simple linear regression since the true f(x) is linear. Normally, a small RSS indicates a tighter fit of the model to the data.

However the RSS decreases as polynomial order increases, since having more predictors normally means a tighter fit. We can observe this empirically:

```{r}

# training args: partition, sample size
training <- function(p, n) {
set.seed(1)
x <- rnorm(n)
y <- 5 + 2*x + rnorm(n, sd = 0.5)

# fitting linear and cubic models
lfit <- lm(y[p] ~ x[p])
pfit <- lm(y[p] ~ x[p] + I(x[p]^2) + I(x[p]^3))

# finding RSE (sigma) with summary func
sigma_linear <- summary(lfit)$sigma
sigma_polynomial <- summary(pfit)$sigma

# finding RSS - done by summing the residuals of squares
rss_lin <- sum(resid(lfit)^2)
rss_pol <- sum(resid(pfit)^2)

# display values
c(linear_RSE = sigma_linear, linear_RSS = rss_lin, 
  poly_RSE = sigma_polynomial, poly_RSS = rss_pol)
}

# using subset of data for training RSS
training(1:20, 100)

# using the remain data
# training(20:100, 100)

```

If we run 1,000 simulations on the above for RSS and review the distribution with a boxplot we can see RSS is lower the higher the order of the polynomial:

```{r}
mat = matrix(rep(0,5000), nrow=1000)
for(i in 1:1000){
  n = 100
  x = rnorm(n)
  y = 5 + 2 * x + rnorm(n, 0.5)
  for(j in 1:5){
  # calculate RSS for the matrix (mat)  
  mat[i,j] = sum(residuals(lm(y ~ poly(x,j,raw=T)))^2)
  }              
}

boxplot(mat, main='RSS for increasing polynomial order',
        xlab='Polynomial order', ylab='RSS',
        col=(c("red","orange","gold",
               "yellow", "lightyellow")))

```

#### 4b
Refer to 4a

#### 4c
We'd expect the RSS to be lower for the cubic regression since higher order polynomial fit the data closest - higher number of predictors equals a tighter fit.

# 5
Using equation (3.4) on page 62, when xi=x(mean), then β1^=0 and β0^=y¯. The equation for yi^ evaluates to y(mean).





# *Applied*

# 8
```{r message=FALSE, warning=FALSE}
library(ISLR)
attach(Auto)
head(Auto)
```

Fitting a linear regression model using lm()
```{r}
regr.fit <- lm(mpg~horsepower, data=Auto); summary(regr.fit)
```
#### 8a: i-iii
From the summary we can see a large *t-statistic* with corresponding low significant *p-value*, hence there is a relationship between predictor and response. The *coefficient* for horsepower is negative implying a decreasing relationship with mpg.

#### 8a: iv
According to our model:
mpg = 39.9358 + (-0.1578*horsepower)

Plugging the numbers:
mpg = 39.9358 - 0.1578*98 = 24.418

Using predict(): 
```{r}
predict(regr.fit, data.frame(horsepower = (c(98))), interval = "confidence")
```

#### 8b
```{r}
plot(mpg~horsepower, data=Auto, main='mpg ~ horsepower')
abline(regr.fit, lwd=3, col='tomato')
```

```{r}
par(mfrow=c(2,2))
plot(regr.fit)
```

The residual data of the simple linear regression model is the difference between the observed data of the dependent variable y and the fitted values ŷ, i.e. *i-th*-residual = *i-th*(y - ŷ).

Here the residuals show a pattern of no linearity. 
Point 334 could identify an an outlier.  

```{r}
plot(regr.fit, which=1, col="cornflowerblue", pch='+')
abline(0, 0)
```




**Alternative, more graphical ways to visualise residuals in-plot are possible.**

##### *Step 1*
With the model fitted, we store residuals and predicted values in two new variables:

```{r message=FALSE, warning=FALSE}
Auto.copy <- Auto # copy Auto in another df for this
Auto.copy$predicted <- predict(regr.fit) # adds predicted to the copied df
Auto.copy$residuals <- residuals(regr.fit)
# initially I had Auto$predicted. For this reason I use drop in Q9 below


# Quick look at the actual, predicted and residual values
# install.packages("dplyr")
library(dplyr)
Auto.copy %>% select(mpg, predicted, residuals) %>% head()
```

##### *Step 2*
Plot the actual and predicted values.

First, plot the actual data:
```{r message=TRUE, warning=TRUE}
# install.packages("ggplot2")
library(ggplot2) # using ggplot style
ggplot(Auto.copy, aes(x=horsepower, y=mpg))+ # set up the canvas
  geom_point() # plot the actual points
```

Next, we plot the predicted value from our regression - we want these to be distinguishable from the actuals.

```{r}
ggplot(Auto.copy, aes(x=horsepower, y=mpg))+ # set up the canvas
  geom_point()+# plot the actual points
  geom_point(aes(y=predicted), shape=1) # add the predicted values
```

To show how actual and predicted values are related we can connect them using *geom_segment()*:

```{r}
ggplot(Auto.copy, aes(x=horsepower, y=mpg))+ # set up the canvas
  geom_point()+# plot the actual points
  geom_point(aes(y=predicted), shape=1)+
  geom_segment(aes(xend=horsepower, yend=predicted))
```

Further adjustment to:
* Clean up the look with *theme_bw()*
* Fade out connection lines by adjusting their *alpha*
* Add our regression line with *geom_smooth()* 

```{r}
library(ggplot2)
ggplot(Auto.copy, aes(x = horsepower, y = mpg))+
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey")+  # Plot regression slope
  geom_segment(aes(xend = horsepower, yend = predicted), alpha = .2)+  # alpha to fade lines
  geom_point()+
  geom_point(aes(y = predicted), shape = 1)+
  theme_bw()  # Add theme for cleaner look
```

##### *Step 3*
Now we want to make an adjustment to highlight the size of the residuals.

We can visualise the relationship between actual values and predicted with the help of colors. Using colors to highlight larger or lower values of actual and predicted helps identify non linearity in the data.

For example, we can see that there is more red for extreme values of horsepower (close to lower and upper bound of the values range, i.e. < 75, > 150) indicating actual values are *greater* than what is being predicted. For centre values (75 < x < 150) we see more blue, indicating the actuals are less that what is being predicted. 

```{r}
# COLOR UNDER/OVER
# Color mapped to residual with sign taken into account.
# i.e., whether actual value is greater or less than predicted
ggplot(Auto.copy, aes(x=horsepower, y=mpg))+
  geom_smooth(method="lm", se=FALSE, color="lightgrey")+
  geom_segment(aes(xend=horsepower, yend=predicted), alpha=.2)+

  # > Color adjustments made here...
  geom_point(aes(color=residuals))+  # Color mapped here
  scale_color_gradient2(low="blue", mid="white", high="red")+  # Colors to use here
  guides(color=FALSE)+ # Color legend removed
  # <

  geom_point(aes(y=predicted), shape=1)+
  theme_bw()
```

# 9
```{r}
# dropping predicted and residual cols that were added earlier
# head(Auto)

drops <- c("predicted","residuals") # cols to drop
Auto <- Auto[, !(names(Auto) %in% drops)] # re-framing 

head(Auto)

```

#### 9a

```{r}
# plot(mpg~., data=Auto, main='mpg ~ all')

pairs(Auto, pch=21) # [1:4] would select a subset of the columns

# pairs(Auto[1:4], pch=21)
# pairs(Auto[5:8])
```

#### 9a

```{r}
drops <- c("name") # dropping name as qualitative
Auto.cor <- Auto[, !(names(Auto) %in% drops)] 


cor(Auto.cor)

# could also do
# cor(subset(Auto, select=-name)) # applies func to all minus name
```

#### 9b

```{r}
mregr.fit <- lm(mpg~. -name, data=Auto); summary(mregr.fit)
```

#### 9a: i-iii

*i.* The F-statistic is large with an associated low (significant) p-value, so we can say that overall there is relationship between the predictors and the response. 

  * F-test in regression compares the fits of different linear models. Unlike t-tests that can assess only one regression coefficient at a time, the F-test can assess multiple coefficients simultaneously.

  * The F-test of the overall significance is a specific form of the F-test. It compares a model with no predictors to the model that you specify. A regression model that contains no predictors is also known as an intercept-only model.
  
  * The hypotheses for the F-test of the overall significance are as follows:

    * **Null hypothesis**: The fit of the intercept-only model and your model are equal.
    * **Alternative hypothesis**: The fit of the intercept-only model is significantly reduced compared to your model.

*** 
If the P value for the F-test of overall significance test is less than your significance level, you can reject the null-hypothesis and conclude that your model provides a better fit than the intercept-only model

***
*ii.* Displacement, weight, year and origin have close to zero level of significance.

*iii.*  The *year* coefficient is positive indicating a positive slope angle for the predictor. In practice we can infer mpg increases the newer is the car. We can observe this trend if we plot mpg regressed on year data:

```{r}
plot(mpg~year, data=Auto)
```

#### 9d

Plotting the diagnostic plots with the plot() function:

```{r}
par(mfrow=c(2,2))
plot(mregr.fit)
```

Detail for residuals and studentized residuals (leverage):

```{r}
plot(mregr.fit, which=1, col="cornflowerblue", pch='+')
plot(mregr.fit, which=5, col="cornflowerblue", pch='+')
abline(0, 0)
```

The *residual plot* shows outliers for points 326, 323, 327.

The data seem to exhibit heteroscedasdicity (non-constant variance in the errors), whereby the magnitude of the residuals increases with the fitted values. This is indidcated visually from a funnel shape.

The *leverage plot* identifies anomalous values of the independent in respect to the explained variable.
For example we observe high leverage at index 14: the predicted mpg for the horsepower value of 225 is 18.88 however the actual mpg value is 14, hence a residual value of -4.88.

Other points in the plot exhibit similar features, for example at index 6 (please refer to table below).

```{r}
mmAuto.copy <- Auto # copying the df to inspect residuals
mmAuto.copy$predicted <- predict(mregr.fit)  # adding columns
mmAuto.copy$residuals <- residuals(mregr.fit)

mmAuto.copy %>% select(horsepower, mpg, predicted, residuals) %>% head(20)
# mmAuto.copy[14, ] # shows only row at index 14 (the identified leverage point)
```

#### 9e

Selecting 4 predictors and regressing on each plus interactions.

```{r}
iregr.fit_a <- lm(mpg ~ weight +displacement +year +origin
                -name -acceleration,
                data=Auto)
iregr.fit_a1 <- lm(mpg ~ weight*displacement +year +origin
                -name -acceleration,
                data=Auto)
iregr.fit_a2 <- lm(mpg ~  +weight +displacement +year*origin
                 -name -acceleration, 
                 data=Auto)

summary(iregr.fit_a)
summary(iregr.fit_a1)
summary(iregr.fit_a2)
```

```{r}
iregr.fit_b <- lm(mpg ~ horsepower +cylinders +weight +displacement 
                  -name -acceleration,
                  data=Auto)
iregr.fit_b1 <- lm(mpg ~ horsepower*cylinders +weight +displacement 
                   -name -acceleration,
                   data=Auto)
iregr.fit_b2 <- lm(mpg ~ horsepower +cylinders +weight*displacement 
                   -name -acceleration,
                   data=Auto)

summary(iregr.fit_b)
summary(iregr.fit_b1)
summary(iregr.fit_b2)
```

#### 9f

Trying log, square root and quadratic transformations for horsepower.

```{r}
regr.fit_poly2 <- lm(mpg ~ horsepower +I(horsepower^2)
                     +weight +year,data=Auto)
regr.fit_poly3 <- lm(mpg ~ poly(horsepower, 3) # using poly includes 1,2,3 deg
                     +weight +year, data=Auto) 
regr.fit_log <- lm(mpg ~ horsepower +I(log(horsepower)) 
                   +weight +year, data=Auto) 
regr.fit_sqrt <- lm(mpg ~ horsepower +I(sqrt(horsepower))
                    +weight +year, data=Auto)

summary(regr.fit_poly2)
summary(regr.fit_poly3)
summary(regr.fit_sqrt)
summary(regr.fit_log)
```

# 10

```{r}
attach(Carseats)
head(Carseats)
```

```{r}
fit.lm <- lm(Sales ~ Price +Urban +US,
             data=Carseats)
summary(fit.lm)
```

Parameters
* Sales: sales in thousands at each location 
* Price: price charged for car seats at each location 
* Urban: No/Yes by location 
* US: No/Yes by location

Coefficients comments

* Price (-0.054459): Sales drop by 54 for each dollar increase in Price - statistically significant
* UrbanYes (-0.021916): Sales are 22 lower for Urban locations - not statistically significant
* USYes (1.200573): Sales are 1,201 higher in the US locations - statistically significant

#### 10b

Sales = 13.043 - 0.054 x Price - 0.022 x UrbanYes + 1.201 x USYes
```{r}
# Sales: 13.043 - 0.054×Price + 0.022×UrbanYes + 1.201×USYes
```

#### 10c

We can reject the null hypotesis for Price and USYes (low p-values).

#### 10d

```{r}
fit.lm1 <- lm(Sales ~ Price +US,
              data=Carseats)
summary(fit.lm1)
```

```{r message=TRUE, warning=TRUE}
# install.packages("plotly")
library(plotly)

Carseats$US[which(Carseats$US == 0)] <- 'Yes'
Carseats$US[which(Carseats$US == 1)] <- 'No'
Carseats$US <- as.factor(Carseats$US)

p <- plot_ly(Carseats, x = ~Price, y = ~Advertising, z = ~Sales, color = ~US,
             colors = c('#BF382A', '#0C4B8E')) %>%
  
  add_markers() %>%
  
  layout(scene = list(xaxis = list(title = 'Price'),
                     yaxis = list(title = 'Advertising'),
                     zaxis = list(title = 'Sales')))

p
```



#### 10f

* fit.lm (Price, Urban, US): RSE = 2.472 R^2 = 0.2393
* fit.lm1 (Price, US): RSE = 2.469 R^2 = 0.2393

fit.lm1 has a slightly better (lower) RSE value and one less predictor variable.

#### 10g

```{r}
confint(fit.lm1)
```

#### 10h
```{r}
par(mfrow=c(2,2))
# residuals v fitted plot doesn't show strong outliers
plot(fit.lm1) 
```




























